Lexicon Reduction for Urdu/Arabic Script Based Character Recognition: A Multilingual OCR
نویسنده
چکیده
Arabic script character recognition is challenging task due to complexity of the script and huge number of ligatures. We present a method for the development of multilingual Arabic script OCR (Optical Character Recognition) and lexicon reduction for Arabic Script and its derivative languages. The objective of the proposed method is to overcome the large dataset Urdu and similar scripts by using GCT (Ghost Character Theory) concept. Arabic and its sibling script languages share the similar character dataset i.e. the character set are difference in diacritic and writing styles like Naskh or Nasta’liq. Based on the proposed method, the lexicon for Arabic and Arabic script based languages can be minimized approximately up to 20 times. The proposed multilingual Arabic script OCR approach have been evaluated for online Arabic and its derivative language like Urdu using BPNN. The result showed that proposed method helps to not only the reduction of lexicon but also helps to develop the Multilanguage character recognition system for Arabic Script.
منابع مشابه
A segmentation-free approach to Arabic and Urdu OCR
In this paper, we present a generic Optical Character Recognition system for Arabic script languages called Nabocr. Nabocr uses OCR approaches specific for Arabic script recognition. Performing recognition on Arabic script text is relatively more difficult than Latin text due to the nature of Arabic script, which is cursive and context sensitive. Moreover, Arabic script has different writing st...
متن کاملSegmentation of Nastaliq Script for OCR
In this paper we have presented a novel segmentation technique for the implementation of an OCR (Optical Character Recognition) for printed Nastalique text, a calligraphic style of Urdu which uses the Arabic script for its writing. OCR for many of the world major languages have been developed and are being used but at present an OCR for Nastalique is not available and the published research on ...
متن کاملRecognition of Urdu Character with Hmm Technique
This paper deals with an Optical Character Recognition system for printed Urdu, a popular Pakistani/Indian script and is the third largest understandable language in the world, especially in the subcontinent but fewer efforts are made to make it understandable to computers. Lot of work has been done in the field of literature and Islamic studies in Urdu, which has to be computerized. Research h...
متن کاملUnconstrained OCR for Urdu using Deep CNN-RNN Hybrid Networks
Building robust text recognition systems for languages with cursive scripts like Urdu has always been challenging. Intricacies of the script and the absence of ample annotated data further act as adversaries to this task. We demonstrate the effectiveness of an end-to-end trainable hybrid CNN-RNN architecture in recognizing Urdu text from printed documents, typically known as Urdu OCR. The solut...
متن کاملDiacritics Recognition Based Urdu Nastalique OCR System
Improvements and new developments in the field of Artificial Intelligence have opened new horizons in the advancement of machines that originally have limited intelligence. As compared to human brain, machines have already better computational speed and storage however there is still much room to improve the capability to acquire and process data and draw conclusions from it on its own. Optical...
متن کامل